[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function#4254
[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function#4254fjiang6 wants to merge 31 commits intoapache:masterfrom
Conversation
…ts; added Markdown documentation
…es and making noncritical methods private
|
Test build #26252 has started for PR 4254 at commit
|
|
Test build #26252 has finished for PR 4254 at commit
|
|
Test FAILed. |
|
Is it possible to do Gaussian similarity in another PR? It should be part of the feature transformation but not within PIC. It would be easier for code review if the PR is minimal. |
There was a problem hiding this comment.
Let's not handle I/O here. We can have an example code under examples/ and load files there.
There was a problem hiding this comment.
You don't need ordering vt before calling kmeans.
There was a problem hiding this comment.
true, it was used sorted for other purpose, will remove.
…iangrui on the PR
|
Test build #26354 has started for PR 4254 at commit
|
data/mllib/pic_data.txt
Outdated
There was a problem hiding this comment.
This is too large for unit tests. Unit tests should be as minimal as possible. For this one, we can construct a very small graph, compute its eigenvector, and derive the clustering result manually, then verify PIC result. For example
a - b - c - g - h
| \ | | \ |
d - e - f i - j
Assign each edge distance 1 and run PIC with k = 2. The solution should be clear.
|
Test build #26354 has finished for PR 4254 at commit
|
|
Test PASSed. |
refactor PIC
|
Test build #26420 has started for PR 4254 at commit
|
|
Test build #26423 has started for PR 4254 at commit
|
|
Test build #26420 has finished for PR 4254 at commit
|
|
Test PASSed. |
|
Test build #26423 has finished for PR 4254 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Should use relative path "api/graphx/...". See examples in this markdown file.
|
LGTM except minor user guide issues, which will be addressed in SPARK-5503. I've merged this into master. Thanks for the contributing! (Now MLlib depends on GraphX.) |
Add single pseudo-eigenvector PIC
Including documentations and updated pom.xml with the following codes:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala